4 OTU Picking and Rarefaction Depth Selection

Amanda Birmingham, CCBB, UCSD (abirmingham@ucsd.edu)

Introducing OTU Picking
Checking the Reference Set
Running OTU Picking
Introducing Rarefaction
Viewing the Counts Per Sample
Selecting Rarefaction Depth

Related Notebooks:

1 Introducing 16S Microbiome Primary Analysis
2 Setting Up Starcluster for QIIME
3 Validation, Demultiplexing, and Quality Control
5 Analyzing Core Diversity

Introducing OTU Picking

The heart of 16S microbiome analysis is the assignment of samples to the microbial "species" (technically "operational taxonomic units", or OTUs) from which they originated. This step is known as "OTU picking", and its results are strongly affected by the picking method used. The available picking methods are differentiated primarily by the sort of set of 16S sequences are used as the reference, defining the OTUs to which new sequences can be assigned; they are:

de novo: No reference is used. Can find never-before-seen OTUs but is slow and sensitive to noise.
closed reference: A reference is used and experimental sequences not clustering with one of the reference sequences are discarded. Fast but potentially misses relevant novel findings.
open reference: A reference is used, but experimental sequences not clustering with one of the reference sequences are then assigned to OTUs using the de novo method. A compromise method with some of the advantages and some of the drawbacks of both its parents.

More details on these approaches can be found at http://qiime.org/tutorials/otu_picking.html?highlight=otu%20picking . Following the QIIME recommendations, we prefer open-reference picking in situations in which the customer has no explicit preference otherwise (which is likely to be most of them!)

Table of Contents

Checking the Reference Set

Also unless otherwise specified by the customer, the Greengenes set of 16S gene sequences is our preferred reference. Because this set is widely used, its current version is included as part of the QIIME install. The path to the fasta file containing the Greengenes 16S sequences is listed in the QIIME config, and can be seen in the output of the command

print_qiime_config.py

In the QIIME config values beginning with assign_taxonomy_reference_seqs_fp, which will look something like

assign_taxonomy_reference_seqs_fp:  /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta

The file-path shown here will be used as input to the OTU picking process.

Table of Contents

Running OTU Picking

Again, in QIIME, OTU picking can be accomplished with a single command-line call--but this one usually takes a while to process, and therefore benefits greatly from being run in parallel. The -a switch tells QIIME to use parallel processing, and the number associated with the -O switch tells it how many processors to use. It is best not to use ALL the CPUs on your cluster, but to leave at least one for other necessary processing; for example, with a 3-node c3.2xlarge cluster, which has 3*8=24 CPUs, you might specify 22 or 23 CPUs.

The full command has the format

pick_open_reference_otus.py -a -O [number of CPUs] -i [sequences file path] -r [reference file path] -o [output directory path]

and an example looks like

pick_open_reference_otus.py -a -O 23 -i /data/library_split/seqs.fna -r /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta -o /data/open_ref_output/

I have seen this step take from 20 minutes to 2 hours, depending on data size, on the three-node cluster described above.

The config file on the QIIME AMI used for this cluster routes temporary files to the /data directory, so during the run you may see various files appear there; these have various prefixes and extensions that look like .e[numbers] or .o[numbers] (ex: >ALIGN_9P6_0.o49, POTU_VuwL_13.e14, etc) The "e" files are errors from individual jobs, while the "o" files are logs from individual jobs. QIIME is supposed to clean all of these up on completion, but I have rarely seen this. However, aside from clearing them out when your job is done, you don't need to give them any particular attention as the results are summarized in the log_[datetime].txt file; note that many of the "errors" at particular steps are dealt with successfully at later steps, so just because you see error files doesn't mean the run failed. To check that it hasn't, it is a good idea to skim through the log to ensure that you don't see any text listed in the "Stderr:" fields for each logged command.

The pick_open_reference_otus.py command generates a lot of output; a high-level overview is given on the generated index.html page, and details are available at http://qiime.org/scripts/pick_open_reference_otus.html?highlight=pick_open_reference_otus , but only a few of the outputs are used directly in subsequent steps. These are:

otu_table_mc2_w_tax_no_pynast_failures.biom: The OTU table in biom format; basically, this is a very large, usually very sparse table listing the number of reads for each identified OTU from each sample. This particular version of the table is the one excluding OTUs with fewer than 2 sequences and sequences that fail to align with PyNAST, and including OTU taxonomy assignments. It is thus the "cleanest" version and the one you'll want to use going forward.
rep_set.fna: A fasta file of one representative sequence from each identified OTU (note that there are several different ways to choose the representative sequences, and generally it is ok just to use whatever pick_open_reference_otus.py's default is).
rep_set.tre: A phylogenetic tree of the reference sequences for the identified OTUs, describing their inferred evolutionary relationships.

Table of Contents

Introducing Rarefaction

In addition to the all-important OTU table as well as the representative sequence set and its phylogenetic tree, you need one more piece of information from the OTU-picking output--but this one requires a judgment call from you as the analyst. This needed info is the sequence depth that will be used in the diversity analyses. As stated in the QIIME tutorial at http://nbviewer.ipython.org/github/biocore/qiime/blob/1.9.1/examples/ipynb/illumina_overview_tutorial.ipynb , "Many of the analyses that follow require that there are an equal number of sequences in each sample, so you need to review the counts/sample detail and decide what depth you'd like. Any samples that don't have at least that many sequences will not be included in the analyses, so this is always a trade-off between the number of sequences you throw away and the number of samples you throw away."

You may wonder why we look at the counts/sample information to make this decision NOW, rather than having done so earlier when we got that information from split_libraries_fastq.py. The reason is that some reads get filtered out during the OTU picking process--possibly even quite a lot of them, for certain data sets and certain picking methods. Therefore, it is necessary to make the sequence depth decision based on the revised distribution of counts per sample after the OTU picking is complete.

Table of Contents

Viewing the Counts Per Sample

To generate a listing of this distribution, run the summarize-table command from the biom software (included as part of the QIIME install) on the otu_table_mc2_w_tax_no_pynast_failures.biom file:

biom summarize-table -i [biom table]

as in this example:

biom summarize-table -i /data/open_ref_output/otu_table_mc2_w_tax_no_pynast_failures.biom

This will print summary information about counts/sample to stdout; you may well want it in a more persistent format so that it is easy to re-sort, etc, so you may want to pipe it to a file.

The output looks, in part, like this:

Num samples: 399
Num observations: 33667
Total count: 6125023
Table density (fraction of non-zero values): 0.019

Counts/sample summary:
 Min: 0.0
 Max: 38638.0
 Median: 14466.000
 Mean: 15350.935
 Std. dev.: 7720.364
 Sample Metadata Categories: None provided
 Observation Metadata Categories: taxonomy

Counts/sample detail:
 925.sm1z: 0.0
 925.sl1z: 0.0
 925.sn2x: 0.0
 925.waterfilter.ij.50: 0.0
 925.sm4y: 1.0
 925.waterfilter.ik.50: 1.0
 925.sn4z: 1.0
 925.in2z: 1.0
 925.in2y: 1.0
 925.sm3z: 1.0
 925.sl1x: 1.0
 925.sl3x: 1.0
 925.so2y: 1.0
 925.waterfilter.ie.50: 1.0
 925.sl4z: 1.0
 925.sk5z: 1.0
 925.sn5x: 1.0
 925.waterfilter.il.50: 1.0
 925.sn3x: 1.0
 925.io2z: 1.0
 925.sl5y: 1.0
 925.ia3y: 1.0
 925.waterfilter.id.50: 1.0
 925.waterfilter.ic.120: 1.0
 925.bisonfeces1: 1.0
 925.bisonfeces3: 1.0
 925.waterfilter.ib.80: 1.0
 925.sm1y: 1.0
 925.im2z: 1.0
 925.so3x: 1.0
 925.waterfilter.ih.50: 2.0
 925.sn3z: 2.0
 925.sk5y: 3.0
 925.so3y: 3.0
 925.y3ntc.h12: 330.0
 925.sg5x: 1383.0
 925.if2z: 1580.0
 925.y4ntc.h12: 2159.0
 925.ie5x: 4966.0
 925.ie5y: 5709.0
 925.sg5y: 5888.0
 925.if5y: 6256.0
 925.sf2y: 6644.0

The most useful part of the file is the Counts/sample detail section, which shows an ascending-sorted list of counts per sample. For example, in the example show above, we know from the Num samples line that there are 399 samples (although most are not shown in this snippet of output), and we see that 4 of them have zero reads, 25 have one read, 2 have two reads, 2 have three reads, 1 has 330 reads, and all the rest have more than 1000 reads.

Table of Contents

Selecting Rarefaction Depth

Using this information, you need to decide what the minimum sequencing depth that all samples included in further analyses need to have (and thus, which samples to leave out). As noted in the quoted text above, for those samples that have MORE than the minimum sequencing depth, the samples will be included, but only a subset of their reads up to the minimum sequencing depth will be used in the further analyses, so you are also deciding how many reads to leave out. If the customer has specified a preferred minimum sequencing depth, that makes the choice easy! However, most users savvy enough to have done this are also savvy enough to do their own analyses, so you are unlikely to get that information from a customer. However, the customer may have shared some information that could inform your decision:

Preference for breadth or depth: If the customer wants to cast a wide net and look at as many samples as possible, even at the risk of missing weak effects on some of them, then a lower sequencing depth will be acceptable. Conversely, if she wants to look for subtle differences in microbial communities, a higher depth will likely be necessary.
Especially important samples: If the customer has indicated that particular samples are critical to the analysis and should be included if at all possible, that may affect the chosen sequencing depth. For example, in the example given above, if the customer had specified that it was crucial to include as many NTC (non-treated control) samples as possible, using a minimum sequencing depth of 330 so as to include sample 925.y3ntc.h12 would be preferable to using a higher number, while if no such constraint existed, a minimum depth somewhere between 1383 and 4966 would probably be a better choice.

If no guidance exists, a good rule of thumb is to favor samples over sequences. The reason for this preference is that current research indicates that "low" sequence depths (10s - 1000s of sequences per sample) often capture all but very subtle variations, as shown in panels (a) and (b) of Figure 2 from Kuczynski et al., Direct sequencing of the human microbiome readily reveals community differences, Genome Biology, 2010:

These panels show "(a) The full dataset (approximately 1,500 sequences per sample); (b) the dataset sampled at only 10 sequences per sample, showing the same pattern".

Keep in mind that even with such constraints, some samples may be beyond use. For example, anything with fewer than 15 reads is unlikely to give any usable diversity information. Taking the sequencing depth down to numbers in the 10s or low 100s should be considered very cautiously: only gross differences will be detectable, and the vast majority of read information will be discarded.

Note that you may want the minimum sequencing depth to be the number of samples associated with a least-read-endowed sample to be included. For example, given the counts/detail shown above, picking a depth of 1000 would indeed lead sample 925.sg5x to be included in the samples analyed downstream, but would only include 1000 of its 1383 sequences (and 1000 of each higher-read sample's reads) in all future analyses. By comparison, selecting the maximum number of reads that still allows the inclusion of sample 925.sg5x, or 1383, ensures that as many reads as possible are used and thus the best power is gained. However, some researchers may nonetheless want to use a nice round number, for easier mental comparison to other experiments.

Once you have decided on the sequence depth to use, make a note of it. Also gather general information about how this choice affects the sample set (for example, "using a sequencing depth of 1383, 34 of 399 samples, or 8.5%, are removed from further processing") and include it in the report to the customer.

Table of Contents



In [ ]: